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We inspect the geometry of proteins by identifying their backbones as framed 
polygons. We find that the left-handed helix region of the Ramachandran map 
for non-glycyl residues corresponds to an isolated and highly localized sector 
in the orientation of the C/3 carbons, when viewed in a Frenet frame that is 
centered at the corresponding Ca carbons. We show that this localization in 
the orientation persists to and Cs carbons. Furthermore, when we extend 
our analysis to the neighboring residues we conclude that the left-handed helix 
region reflects a very regular and apparently residue independent collective 
interplay of at least seven consecutive amino acids. 



I: INTRODUCTION 
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Asparagine (ASN) is the predominant non-glycyl residue in the left handed hehx (L-a) region of 
the Ramachandran map of folded proteins [1], [2]. According to a prevailing view this reflects the 
presence of a localized but non-covalent attractive carbonyl-carbonyl interaction between the side- 
chain and backbone [3], [1], [5]. Such a carbonyl-carbonyl interaction can only be present in ASN, 
aspartic acid (ASP), glutamine (GLN) and glutamic acid (GLU). Indeed, the propensity of ASP 
that is structurally very similar to ASN is clearly amplified in the L-a region, while the somewhat 
lower propensity of GLN and GLU can be explained to be a consequence of steric suppression |3]. 
Consequently one may suspect that when located in the L-a region, ASN and ASP residues should 
give rise to atypical fold geometries. 

ASN and ASP are also more frequently than any other amino acid subject to in vivo post- 
translational modifications including spontaneous nonenzymatic deamidation from ASN to ASP 
[6] and racemization from L-ASP into D-ASP [7J. Since these processes are presumed to have con- 
sequences to cellular and organismal ageing [6], [9] and they might also enhance the emergence of 
amyloid based neurodegenerative diseases [8], [9] there are several good reasons to search for those 
patterns in folded proteins that appear to set these two residues apart from the rest. 

In this article we apply recently developed visualization techniques [lO] to analyze protein confor- 
mations in the L-a region. We are particularly interested in the common aspects of the ASN and 
ASP residues in this region. But in lieu of the Ramachandran map which is topologically a torus that 
has been projected onto the plane and as such is subject to discontinuities, our approach exploits 
the visually more engaging two-sphere. For this we interpret a folded protein as a piecewise linear 
framed chain with vertices located at the Ca carbons [lOj. The framing can be introduced in various 
different ways, and examples include the geometric Frenet frame [TT], [12], the geodesic Bishop frame 
|13j . and the protein specific carbon frame that we obtain by utilizing the direction of the Cfs 
carbon along a protein backbone to construct an orthonormal framing [TU]. This concept of framed 
chains is widely employed for example in aircraft and robot kinematics, stereo reconstruction and 
virtual reality [TT], [12]. In those applications different framings correspond to different camera gaze 
positions, that one introduces for the purpose of extracting diverse and complementary information 
on various geometric aspects and structural properties of the system under investigation. However, 
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thus far this leeway that can be enjoyed by deploying different frames has been sparsely applied in 
analyzing protein conformations. Here we propose that the freedom in the choice of frames provides 
a powerful and pristine tool for capturing universal aspects in the geometry of folded proteins. 

METHODS 

A: Framing 

The framing of a piecewise linear chain can be introduced using the Denavit-Hartenberg [H] 
formalism that was originally developed in robotics but subsequently applied extensively also in 
other disciplines. Here we resort to a variant that has been elaborated in [lOj. It utilizes the transfer 
matrix formalism [15] to describe a protein with residues using the coordinates of the backbone 
Ca carbons [i = 1, ...,N); The coordinates can be downloaded from the Protein Data Bank (PDB) 
|16j . For each of the segments that connect the backbone Ca carbons we compute the unit length 
tangent vector tj, binormal vector and normal vector rij using 

^ & b,- = -; r & n,- = b,- x t,- 



The right-handed triplet (nj,bj,tj) constitutes the orthonormal Discrete Frenet frame (DF frame) 
for each residue along the backbone chain, with base at the position of the vertex r-j. At each vertex 
i, a general orthonormal frame (61,62) on the normal plane to tj can be obtained by rotating the 
DF frame around the tangent vector, 

cos A sin A 0^ 

-sin A cos A I I b I (1) 




1^ 

The parameters Aj specify the new frame at vertex rj. Since a change in Aj is a frame rotation 
around tj it has no effect on the geometry of the curve. It only rotates the local frame (61,62) at 
the i*'* vertex around tj. If we choose all Aj = we get the DF frame at each vertex, while for 
non-vanishing choices of Aj we obtain alternative frames. In [lOj it has been shown that the generic 
set of frames is subject to the following generalized DF equation 

(2) 

\ t / 

Here the transfer matrix that determines the frame at site i + 1 in terms of the frame at site i is 

7^i+i,j = exp{fi:j+i,j(T^ cosAj+i - sinAj+i)} 
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The T" (a = 1,2,3) are the adjoint S0(3) Lie algebra generators, and Hi+i^i and Tj+i j are the 
geometrically determined bond and torsion angles, defined as shown in Figure 1. Note that according 




FIG. 1: Definition of bond /t and torsion r angles in terms of the backbone Ca carbons. 

to ([2]), the bond and torsion angles are link variables. In particular, the definition of the bond angle 
involves three vertices while the definition of the torsion angle involves a total of four vertices. 

We have employed the generalized DF equation ^ to inspect the structure of folded proteins. As 
a data set we have utilized all those proteins that are presently in PDB, with an overall resolution 
better than 2.0 Angstrom. There is no additional curation or data pruning. As a control set we 
have used the highly curated version v3.3 Library of chopped PDB files for representative CATH 
domains [IT]. Since the conclusions we draw from these two data sets are parallel, we only describe 
here the results for the first one in detail. 

B: Backbone Visualization 

We start by describing the visualization of the (frame independent) directions of the tangent vectors 
tj along the backbone. For this we note that with the base of tj at the location of the corresponding 
Ca carbon, its tip determines a point on the surface of a unit two-sphere. The location of this 
point is described by the bond angle as the latitude, and the torsion angle as the longitude of the 
two-sphere. Since the tj are frame independent, the traces of their tips on the two-sphere provides 
frame independent information of the backbone geometry. For visualization of the two-sphere, we 
can stereographically project it from south-pole (kj = = vr) onto the two dimensional plane. 

This leads us to the angular distributions that we have displayed in Figure 2, and in each of the four 
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classes in this Figure we have used the prevalent PDB identification of the ensuing structures. 




FIG. 2: The four major protein structures: a) a-helices, b) /3-strands, c) 3/10-helices and d) loops according to PDB classification 
in our data set. In each Figure the center of the annulus is the north-pole of the two-sphere where the bond (latitude) angle 
K = so that the two consecutive unit tangent vectors ti and t^+i are parallel. The bond angle then measures distance from 
the center of the annulus so that the south pole where n — n corresponds to points at infinity on the plane, where ti and t^+i 
become anti-parallel. The torsion i.e. longitude angle r G [— 7r,7r] increases by 2n when we go around the center of the annulus 
in counter-clockwise direction. The color coding in all our Figures increases from white to blue to green to yellow to red and 
describes the relative number of conformations in PDB in a log-squared scale. 

The four maps in Figure 2 portray all the essential features of the Ramachandran map. But there 
are also important differences that makes it profitable to utilize these maps both in lieu of and in 
combination with the Ramachandran map. The predominant feature in Figures 2 is that the PDB 
data is concentrated in an annulus which is located roughly in the range k ^ (1, 7r/2). The exterior 
of the annulus (roughly k > vr/2) is an excluded region that describes conformations with steric 
clashes. The interior (roughly k < 1) is sterically allowed but excluded as long as proteins remain in 
the collapsed phase; The interior region becomes occupied when we cross the 0-point and proteins 
assume their unfolded conformations. The loops also appear to have a slightly higher tendency to 
bend towards left i.e. r < 0. Note that in the Figures for a, (3 and 3/10 the blue regions correspond 
to residues whose present PDB classification are not consistent with our computed («;, r) values; 
Such issues have also been raised in [2]. 

Moreover, the Figure 2 reveals that the PDB data displays innuendos of various underlying reflec- 
tion symmetries: In the Figure 2d (loops) there is a clearly visible mirror of the standard right-handed 
a-helix region, located in the vicinity of the outer rim with /t ^ 1.5 and with torsion angle close to 
the value r ^ — 27r/3. A helix in this regime would be left-handed and tighter than the standard 
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a-helix. There is also a clear mirror structure in the Figure 2b for /3 strands, the standard region is 
{k,,t) ^ (l,7r) and its less populated mirror is located around (k, t) ^ (1,0). The mirror symmetry 
between the ensuing extended regions persists in the Figure 2d for loops. 

Finally, in the Figure 2d we observe a small elevated (yellow) region in the vicinity of (k, r) ^ 
(1.5, — 7r/3). This is the region of helices that are spatial left-handed mirror images of the standard 
a-helices. There is also a (slightly) elevated (green) mirror of this region around (k, r) ^ (1.5, 27r/3). 
This is like the {k, t) ^ (1.5 , — 27r/3) mirror of the standard right-handed a helices. 

C: Side-Chain Visualization 

In Figure 3 (top) we display the orientations of the Cp carbons in the Frenet frames of the Ca 
carbons i.e. all Aj = in ([T|; Recall that a C/^ carbon is present in all non-glycyl residues. In 
this Figure the Ca carbons are at the center of the sphere. Consequently the Figure 3 (top) shows 
directly the locations of the carbons as seen by an observer who traverses the Ca backbone with 
a gaze orientation that is determined by the DF frame. We find it remarkable that in this frame the 
directions of the C^? carbons are subject to only very small nutations. The additional feature is the 
presence of the highly localized, isolated island. Indeed, we have found that for non-glycyl residues 
this isolated island coincides with the L-a region of the Ramachandran map. This is shown in Figure 
3 (middle) where we display the direction of the Cjs carbons solely for those residues that are in the 
L-a Ramachandran region. Finally, in the Figure 3 (bottom) we display the DF frame distribution 
of the Cjs carbons for those ASN that are located in loops only, according to PDB classification. The 
relatively high propensity of ASN in the L-a island is prominent. In the sequel we shall concentrate 
our attention solely on the isolated L-a island in Figure 3 (middle). 

Figure 4 describes the propensity of different amino acids in the L-a island of Figure 3. This Figure 
confirms the high propensity of ASN (N) that is already visible in Figure 3 (bottom). We find that 
ASP (D) has also relatively high propensity. But the propensity of histidine (H) is practically equal in 
our data. Furthermore, several non-carbonylic amino acids have a higher propensity than GLU (E). 
The /3-branched isoleucine (I), valine (V) and threorine (T) all have clearly suppressed propensities 
and proline (P) is practically absent, presumably reflecting the presence of steric constraints [3], [2], 
0. 
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FIG. 3: The DF frame directions of the carbons. On top all residues in our data set including ASN. In the middle we 
display only the L-q region of the Ramachandran map. On bottom we display only those ASN that are in a loop in PDB 
classification. The color coding is proportional to the ensuing propensity. 




NDHKQRCSVEFMAWLTVIP NDHKQRCYEMAFSLWTVIP 



FIG. 4: The propensity of non-glycyl residues in the L-a island of Figure 3. In the left we display the result for all amino 
acids in our entire data set, and in the right for those in our data set that are classified as loops in PDB. The propensity of 
carbonylic ASN (N) is clearly enhanced in both cases. But in both cases the similarly carbonylic ASP (D) has about the same 
propensity with the non-carbonylic HIS (H), and the carbonylic GLU (E) is relatively quite suppressed. 



We have also analyzed the directions of the carbons, as seen by a Ca observer in the DF frame. 
The result presented in Figure 5 reveals that at the level of the data points in the single L-a 
island of Figure 3 remain highly localized even though it now becomes divided into two separate 
islands. There is a putative gauche+ {g+) island where around 70% of the residues in the L-a island 
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are located, and a putative trans island for the rest; We do not see any putative gauche- region. 
The amino acid propensities of these two islands is displayed in Figure 6. ASN is the most populous 
in both C-y islands. However, the propensity of ASP is elevated only in trans island. In the g+ 
island both non-carbonylic HIS (H) and LYS (K) and even the carbonylic GLN (Q) have a higher 
propensity than ASP. 




FIG. 5: The DF frame directions of C-y carbons for residues located in the L-a island. 
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FIG. 6: The propensity of different amino acids in the putative g+ island (left) and trans island (right) 



In Figure 7 we plot the percentage of different amino acids as they appear in our data set in the 
two C-y islands. Note that around 43% of residues in the putative g+ island are non-carbonylic, 
while in the putative trans island the number is close to 12%. 




FIG. 7: The relative number of different amino acids in the putative g+ (left) and trans (right) C^-islands. 



In Figure 8 we plot the Cs carbons in the DF frame of the Ca carbons. Since ASN (and ASP) has 
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no Cs carbon, we display instead the direction of the side-chain O atom for ASN, and the result is 
shown in Figure 9. 




FIG. 8: The directions of the Ca-carbons in the DF frame of the Ca carbons on the two-sphere (left), and on stereographically 
projected Cp frame (right). 

From Figure 8 we observe that the directions of the Cs continue to be highly localized, indepen- 
dently of the type of amino acid. We also observe the formation of a second but relatively weakly 
occupied island at larger values of the latitude angle and with longitude angle x ~ ~27r/3, clearly 
visible in Figure 8 (right). At the moment we do not have a basis to conclude whether this is a real 
effect or simply a reflection of problems in the experimental data. 

In Figure 9 we display the O atoms of the ASN side-chain according to PDB identification in 
the DF frame of the C^-carbons. We note that the two islands appear to become divided into 
four distinct but still highly localized islands. However, it is well known that the identification 
between the ASN side-chain O and can be very difficult [18], and thus we have displayed in 
Figure 9 (bottom) the A^ atoms according to PDB identification as well. By comparing the Figures 
9 (middle) and (bottom) we conclude that most likely the two inner-most islands denoted a and b 
in Figure 9 describe A^ instead of O atoms. 

RESULTS AND DISCUSSION 

Previously it has been proposed that in the case of ASN and ASP, the L-a Ramachandran region 
is stabilized by a non-covalent attractive interaction between the side-chain and backbone carbonyls 
[3], [5]. Here we find that in the DF frame and independently of the amino acid, the ensuing side- 
chain Cp carbons all point to the same direction that we have denoted L-a in Figure 3. We have 
shown that this region does also coincide with the non-glycyl L-a region of the Ramachandran 
map. Furthermore, the strong localization in the direction continues to persist when we extend our 
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FIG. 9: On top, the directions of the side-chain O atoms of ASN in the DF frame of the Ca carbons. In the middle, the same 
as on top but on stereographic projection and in the C/3 frame. Bottom, the directions of the side-chain A'^ atoms of ASN in 
the same frame as in middle suggesting that the correct identification of a and b regions in the top and middle figure is A''. 
The identifications follow PDB. (t is trans and g+ is gauche+) 



analysis to the and Cs carbons of the L-a residues, the results are displayed in Figures 6 and 9 
respectively; In the case of ASN (ASP) where there is no Cs carbon we utilize the side chain O and 
N atoms instead and as shown in Figure 10 the results are very similar. This strong localization in 
the directions and in particular its apparent residue independence suggests that the presence of the 
L-a island is associated with some relatively residue independent structural property of the protein 
backbone. 

According to [3], in the case of ASN and ASP the backbone oxygen atom has a special role. If so, 
it should somehow be reflected in the ensuing backbone conformations. In order to scrutinize this, 
we have inspected the orientations of all backbone O atoms in our data set, in a group of residues 
where the i*^^ one is located in the island. The result is shown in Figure 11: We find that there is a 
very strong localization in the directions of the backbone O atoms, and this localization is residue 
independent and extends itself over at least four different residues: The O atoms in both the L-a 
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site i and at the preceding site i — 1 are highly locahzed into a single direction, while for both the 
site i — 2 and the site i + 1 we identify three closely located directions, that are presumably related 
to the three available trans/gauche conformations. 

Indeed, it appears that only very few backbone geometries are accessible in the vicinity of a residue 
that is located in the L-a island. Furthermore, since the regime that extends from the {i — 2)nd to 
the {i + l)st site involves three sets of curvature and torsion angles, each of them defined in terms 
of three resp. four residues we conclude that the backbone geometries reflect the collective interplay 
of at least up to seven different sites along the backbone. 




i-i 



FIG. 10: The orientations of backbone O atoms around the site i that is located in the L-a island, in the Ca discrete Frenet 
frame of the i*** central Cc-carbon. For the i**^ and {i — l)st atom only one position appears to be available while the {i + l)st 
and (1 — 2)nd atoms each have three available (trans/gauche) positions. The angle </> is measured from the N axis. 

To further inspect the universality of the L-a island, we consider the distribution of the backbone 
bond and torsion angles that are attached to a Ca carbon when a residue is located in the L-a 
island. The result is shown in Figure 11 separately for ASN and ASP, and for the remaining non- 
glycyl amino acids. We find no essential difference between the different residues, nor do we find any 
essential difference between the trans and g+ islands. Moreover, we observe the following general 
pattern: For the backbone Ca link that precedes the L-a island, three different regions on the (k, r) 
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FIG. 11: The {k, t) distributions for backbone links that are attached to a Co, carbon with residue in the L-a island. Separately 
for ASN and ASP, and for the rest. On left column the Ca carbons in the case where the corresponding carbon is in the 
trans island, on right for those where the C-y is in the g+ island. First row is for link that precedes either ASN or ASP. Second 
row is for link preceding any other non-glycyl amino acid. Third row is for link following either ASN or ASP. Fourth row is for 
link following the others. 

plane are probable. These are the regions that we have denoted with a, b and c respectively in the 
Figure 12-left; In this Figure we have combined all the data that are displayed separately in the 
parts a, b, c and d of Figure 11. After the L-a island there are also three different regions that are 
probable. We denote these regions with letters b and d and e respectively in the Figure 12-right, 
now combining the data in parts e, f, g, h in Figure 11. Note that the regions b in the two parts of 
Figure 12 practically coincide. By inspecting the protein structures in our data set we conclude that 
the presence of a residue in the L-a island causes the following phenomenologically verified selection 
rules in Figure 12: 
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t:=0 t=0 

FIG. 12: The {k, t) distributions for all backbone links that are attached to a residue in the L-q island. On the left for the link 
preceding the residue in the L-a island, on the right following the residue in the L-a island. 

• The region a can only precede regions d and e. 

• Both regions b and c can be followed by any of the three regions 
b, d and e. 

Furthermore, we find that 

• The residue preceding either a or c is not located in the L-a 
island. 

• Both the residue preceding and following b can be located in 
the L-a island. 

• If the two residues following c are both in the L-a island, the 
first residue connects c to b and the second connects b to b. 

We remind that since the region b has the same curvature angle as the standard a-helix region 
and since the torsion angles are equal in magnitude but have an opposite sign, a repeated structure 
in b is the right-handed mirror image of the standard a-helix, this is truly the region of left-handed 
a-helices. 

We have also found that there appears to be four main trajectories that are followed by the 
orientations of those Ca carbons that surround a residue located in the L-a island. The result is 
shown in the schematic Figure 13, where the pink arrows correspond to residues in the island; Recall 
that the curvature and torsion angles are link variables, they connect two Ca carbons according to 

In Figure 13a, the first residue takes us away from an a-helix region to the region a in the Figure 
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FIG. 13: Four different trajectories through a residue in the L-q island that are common in our data set. In each case the pink 
lines denotes the transition caused by the residue in the L-a island. The trajectory a described a turn from a-helix region to 
/3-strand region, and the remaining ones start and end in the /3 region; these include /3-turns. 

12 left (black arrow). This is followed by a residue in the L-a island, that takes us to the region d in 
the Figure 12 right (pink arrow). Finally, there is a transition to the /3-strand region (black arrow). 

The second trajectory in Figure 13b starts from the /3-strand region with a residue that takes it 
into region c in Figure 12 left. The following residue that is located in the L-a island then causes a 
transition into region d in Figure 12 right (pink arrow). This is followed by a transition back to a 
/3-strand region. 

The third trajectory that we have described in Figure 13c starts from the /3-strand region and 
proceeds to region c in Figure 12 left. From there the trajectory proceeds to region b in Figure 12 
left, with the transition caused by a residue in the L-a island. This is followed by a transition to 
region d and then back to the /3-strand region. 

Finally, the fourth trajectory that is also common in our data set is the one displayed in Figure 
13d. It is similar with the trajectory described in Figure 13c, except that now the residue that is 
located in L-a island causes the transition from b to d in Figure 12. 

The Figures 12 and 13 reveal that the presence of a residue in the L-a island relates to a collective 
topology of the backbone involving several residues. Since the definition of a bond angle takes three 
Ca carbons and the definition of a torsion angle takes four (see Figure 1), we conclude that the 
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topology of the trajectories in Figures 13a and 13b involve the interplay of seven residues while in 
13c and 13d there are a total of eight residues present. 

Finally, we have verified that all our results are independent of the data set we have used by 
similarly analyzing the proteins in the version v3.3 Library of chopped PDB files for representative 
CATH domains. The results are very similar. But in addition, we find that the propensity is largest 
in the (mainly-^) CA level classes 2.90, 2.160 where over 5% of all residues are in the island. We 
also find that any CA level family has at least 1% of their residues in the island, except 1.40 where 
the single representative with PDB code IPPR has no residues in the island. 

CONCLUSION 

We have investigated the non-glycyl residues that are located in the L-a region of the Ramachan- 
dran map. Independently of the amino acid, we find that in the Discrete Frenet frames of the 
carbons the corresponding side-chain carbons are localized in the same direction. This univer- 
sality in the orientation persists when we investigate the and Cs carbons, the side chain O and 
N atom in the case of ASN and ASP. The results suggest that instead of reflecting only a local 
interaction between a given backbone unit and its residue, the L-a island is associated with a largely 
residue independent backbone conformation that involves the collective interplay between several 
consecutive residues. 

When we proceed to analyze the distribution of those backbone bond and torsion angles that are 
associated with the links that both precede and follow a residue that is located in the L-a island, we 
find that independently of the residue these angles display very similar patterns. Since the definition 
of a bond angle takes three carbons and the definition of a torsion angle takes four, this prompts 
us to propose that the geometrical structure associated with the presence of a residue in the L-a 
island reflects the interplay of at least seven consecutive backbone units. In particular, we have 
not been able to pin-point any obvious local reason (charged, polar, acidic, hydrophobic/philic) to 
explain the presence or absence of a residue on the L-a region. 

Our approach is based on a novel method to depict proteins. In the course of our analysis we have 
been able to observe several systematic patterns including anomalies in the PDB data, suggesting 
that the method we have utilized has a potential of becoming a valuable tool for both experimental 
and theoretical protein structure analysis and fold prediction. 
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